Telco churn analysis

This notebook will complete an exploratory data analysis on the telco churn dataset followed by feature engineering for modelling.

There are some extra cells at the end that show how this data set can be clustered and what the results can tell us about the customers in each group.

In [2]:
# import necessary packages for the EDA

# pandas for dataframes and numpy for linear algebra and matrices
import pandas as pd
import numpy as np

#standard plotting
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns',99)


import warnings
warnings.filterwarnings('ignore')


from sklearn.preprocessing import LabelEncoder
In [3]:
#telco dataset
telco = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

Exploratory Data Analysis (EDA)

Before diving into the dataset, it's always import to look at the read.me file to see if there are explanations for each feature. Thankfully, this is one of the datasets that does have that:

  1. customerID : Customer ID
  2. gender : Whether the customer is a male or a female
  3. SeniorCitizen : Whether the customer is a senior citizen or not (1, 0)
  4. Partner : Whether the customer has a partner or not (Yes, No)
  5. Dependents : Whether the customer has dependents or not (Yes, No)
  6. tenure : Number of months the customer has stayed with the company
  7. PhoneService : Whether the customer has a phone service or not (Yes, No)
  8. MultipleLines : Whether the customer has multiple lines or not (Yes, No, No phone service)
  9. InternetService : Customer’s internet service provider (DSL, Fiber optic, No)
  10. OnlineSecurity : Whether the customer has online security or not (Yes, No, No internet service)
  11. OnlineBackup : Whether the customer has online backup or not (Yes, No, No internet service)
  12. DeviceProtection : Whether the customer has device protection or not (Yes, No, No internet service)
  13. TechSupport : Whether the customer has tech support or not (Yes, No, No internet service)
  14. StreamingTV : Whether the customer has streaming TV or not (Yes, No, No internet service)
  15. StreamingMovies : Whether the customer has streaming movies or not (Yes, No, No internet service)
  16. Contract : The contract term of the customer (Month-to-month, One year, Two year)
  17. PaperlessBilling : Whether the customer has paperless billing or not (Yes, No)
  18. PaymentMethod : The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
  19. MonthlyCharges : The amount charged to the customer monthly
  20. TotalCharges : The total amount charged to the customer
  21. Churn : Whether the customer churned or not (Yes or No)

There are 21 total features, the last being "Churn" which is a binomial (Yes/No) and will be used as the target for predicting.

Furthermore, of the remaining variables, only 3 are numerical (tenure, TotalCharges, MonthlyCharges). The other 17 are boolean (1/0, yes/no) or categorical, such as Internet Service (DSL, Fiber Optic, No).

Now let's dig in

In [3]:
#data size and list of data types
print(f'Data Size: {telco.shape}')
print(f'Data Types: {telco.dtypes}')
Data Size: (7043, 21)
Data Types: customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object
In [3]:
#any nul?
telco.isna().sum().sum()
Out[3]:
0
In [ ]:
#take a glance to understand what each data type holds
telco.head()
In [5]:
# Weirdly the totalcharges column is inputed as an object, I will change that
#turns out total charges has missing values. which is why it came up as an object.
missing_vals = list(telco['TotalCharges'].loc[telco.TotalCharges == ' '].index)

#look at the rows where missing_values are
telco.loc[missing_vals].head()
Out[5]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
488 4472-LVYGI Female 0 Yes Yes 0 No No phone service DSL Yes No Yes Yes Yes No Two year Yes Bank transfer (automatic) 52.55 No
753 3115-CZMZD Male 0 No Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.25 No
936 5709-LVOEQ Female 0 Yes Yes 0 Yes No DSL Yes Yes Yes No Yes Yes Two year No Mailed check 80.85 No
1082 4367-NUYAO Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.75 No
1340 1371-DWPAZ Female 0 Yes Yes 0 No No phone service DSL Yes Yes Yes Yes Yes No Two year No Credit card (automatic) 56.05 No
In [6]:
# Impute missing Total charges value with Monthly charge (reasonable to assume)
telco.loc[missing_vals,['TotalCharges']] = telco.loc[missing_vals,['MonthlyCharges']].values
In [7]:
#make TotalCharges a numerical column
telco['TotalCharges'] = pd.to_numeric(telco['TotalCharges'])
In [8]:
#create a list of all the categorical columns - where there are less than 10 unique values
CATEGORICAL = list(telco.columns[telco.nunique() < 10])
In [9]:
len(CATEGORICAL)
Out[9]:
17
In [14]:
#create subplots of all categorical features
fig, axs = plt.subplots(3,6, figsize=(20, 15))

data = telco[CATEGORICAL]
for ax, d in zip(axs.ravel(), data):
    sns.countplot(data[d],ax=ax)
    ax.set_title(str(d))
    for tick in ax.get_xticklabels():
        tick.set_rotation(90)
    plt.tight_layout()
plt.savefig('catcounts.png')
plt.show()

What stands out from these plots?

  1. The target - Churn - is imbalanced, less than 50% of customers are likely to churn.
  2. There is a balance between customers who have multiple lines and don't.
  3. Less than 20% of the customers are seniors
  4. The majority of telco customers have no dependents, but still have a partner. (Less families more couples)
  5. Month-Month is the preferred contract type
  6. Most of the customers have an internet connection but don't have protection or other internet services.
  7. Virtually all customers have a phone plan
  8. Paperless billing and electronic check are the preferred method of billing and payment respectively.

This is very important information that will help when feature engineering to reduce dimensionality of the features.

In [15]:
#what can we learn about the numerical columns
telco.describe()
Out[15]:
SeniorCitizen tenure MonthlyCharges TotalCharges
count 7043.000000 7043.000000 7043.000000 7043.000000
mean 0.162147 32.371149 64.761692 2279.798992
std 0.368612 24.559481 30.090047 2266.730170
min 0.000000 0.000000 18.250000 18.800000
25% 0.000000 9.000000 35.500000 398.550000
50% 0.000000 29.000000 70.350000 1394.550000
75% 0.000000 55.000000 89.850000 3786.600000
max 1.000000 72.000000 118.750000 8684.800000

The longest tenured customer has been with telco for 72 months and the highest monthly/total charges are $118/$8684 respectively. It is important to note that the total charge is over the tenure of the customers contract with telco.

There is a high std deviation in the Total charges, which tells me that there is likely not as many long tenured customers.

Let's look at the relationship between these 3 numerical columns.

In [17]:
#Scatter plot of monthly v total charges, points are colorized by Churn (no=blue, yes=orange)
plt.figure()
sns.scatterplot(telco['MonthlyCharges'],telco['TotalCharges'],hue=telco['Churn'])
plt.title('Monthly v Total')
plt.show()

The above plot shows something important about the churn rate, total charges aren't as paramount in attrition compared to monthly charges. From this plot it's easily inferred that a high monthly charge is likely to lead to churn.

In [25]:
#let's look at the distributions
#monthly hist
plt.subplot(3,1,1)
sns.distplot(telco['MonthlyCharges'],color='red',norm_hist=True)
plt.title('Monthly Charge Distribution')
plt.show()

#total hist
plt.subplot(3,1,2)
sns.distplot(telco['TotalCharges'],color='navy',norm_hist=True)
plt.title('Total Charge Distribution')
plt.show()

#tenure
plt.subplot(3,1,3)
sns.distplot(telco['tenure'],color='green',norm_hist=True)
plt.title('Tenure Distribution')
plt.show()

These distributions are are quite informative.

From the top, the monthly charges seem to be normally distributed compared to the other 2. There is a significant portion of customers though, that are paying around $20 which gives the normal distribution a second node.

The Total charges show a different story, the majority of customers have below $2000 total charges. There could be many reasons for this but we will only know for sure after further Analysis. For now, a guess would be that most customers are low tenure, or that tenure is not as significant a factor in how much a customer's account has been charged. For now, we know that total charges have poission distribution (long tail)

The final distribution that shows the customer tenures is interesting. This is a bimodal distribution where there are two significant nodes. It seems that there are a lot of new customers at telco, however, they do hold a healthy portion of loyal customers around 72 months.

I will now do some exploratory analysis on the relationship between all the variables and churn

In [389]:
# function to create 3 kde plots for each numerical column. 
def kdeplot(feature,df):
    plt.figure(figsize=(9, 4))
    plt.title("KDE for {}".format(feature))
    ax0 = sns.kdeplot(df[df['Churn'] == 'No'][feature].dropna(), color= 'navy', label= 'Churn: No')
    ax1 = sns.kdeplot(df[df['Churn'] == 'Yes'][feature].dropna(), color= 'orange', label= 'Churn: Yes')

kdeplot('tenure',telco)
kdeplot('MonthlyCharges',telco)
kdeplot('TotalCharges',telco)

What we can infer from the above plots is that customers are more likely to churn if they have a short tenure (new customer), maybe they got a plan that was only available for a short time?

We also see that churn increases as monthly charges go up. The total cost gives us less information that can be acted on because the nodes are tied together for both yes/no.

In [10]:
#FIRST SAVE - At this point df used for easy of typing and because changes will be made 
df = telco.copy()
In [60]:
# create facet grid; compares churn of customers with/without partners, and creates 2 grids, Seniorcitizens
g = sns.FacetGrid(df, col="SeniorCitizen", height=4, aspect=.9)
ax = g.map(sns.barplot, "Partner", "Churn", palette = "pastel", order= [1, 0])

This plot is very insightful. 2 Thing I can infer from here is that both senior citizens and younger age groups tend to not have a partner. Additionally, those without partners will lead to higher churn.

I believe there may be a relationship between age/families/churn. From this graph alone, it's clear that senior citizens are more likely to contribute to the churn rate. However, I will plot some more facet's using the senior citizen feature as a grid.

In [384]:
#create a function to plot factet grids of different columns quickly
def create_facet(col1,col2,col3,df):
    g = sns.FacetGrid(df, col=col1, height=3, aspect=.9)
    ax = g.map(sns.barplot, col2, col3, palette = "Blues_d",order=list(telco[col2].unique()))
    return plt.savefig(f'saved_figs/{col2}_churn.png')
In [65]:
create_facet('SeniorCitizen','Dependents','Churn',df)

This next plot further matches my view on senior's and churn. Because I have some basic idea of some relationships between the dataset and churn (age/family/tenure).

In [309]:
create_facet('SeniorCitizen','PhoneService','Churn',df)
/Users/karandesai/anaconda3/envs/fancy/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
In [310]:
create_facet('SeniorCitizen','PaperlessBilling','Churn',df)
I think it is safe to say that there is defintely a relationship with age and churn
In [327]:
for i in CATEGORICAL:
    create_facet('SeniorCitizen',i,'Churn',df)
    plt.title(i,loc='center', fontdict={'fontsize': 15})
    plt.show()

While this is a lot to go through, thankfully I saved the figures indepentently looked at them seperately.

Here's what can be highlighted and possibly acted on:

  1. Families are less likely to churn (of those that do - more analysis can be done)
  2. Customers with Fiber optic internet service have the highest churn rate at telco (possible price competition?)
  3. Customers that do not have online service have the highest churn rate, while the majority of customers don't, it is worth highlighting and possibly creating a promotion to provide customers with it at a discount(?)
  4. Online backup is another one that, again, is not as noteworthy but could be acted on.
  5. Device protection plays a significant factor in users that churn. Possibly pushing the protection plan to customers that do not have it could curve this phenomenon.
  6. Customers that do not have techsupport are more likely to churn. This is definitely something to highlight. Do customers not know how to contact tech support? Are they not happy with it?
  7. mTm contracts have the highest churn rate. This is likely related to the relationship between tenure and churn. Customers may be jumping between providers and prefer the mTm. Worth looking at.
  8. Finally, seniors are more likely to churn. Worth trying to understand why. Maybe a campaign to contact the customer to see how they feel about the service(?).
In [409]:
def create_box_subplots(col1,df,col_list):
    size = int(len(col_list))
    fig, axs = plt.subplots(6,3, figsize=(20,30))
    
    for ax, d in zip(axs.ravel(), df[col_list]):
        sns.boxplot(x=df[d],y=df[col1],hue=df['Churn'],ax=ax,whis='range',dodge=True)
        sns.swarmplot(x=df[d],y=df[col1],hue=df['Churn'],ax=ax,dodge=True,size=1,edgecolor='k',color='black')
        ax.set_title(str(d))
        for tick in ax.get_xticklabels():
            tick.set_rotation(90)
        plt.tight_layout()
        ax.legend(loc='lower right')
    plt.savefig(f'{col1}vCATEG.png')
    plt.show()
    

Below are boxplots overlayed with a swarm plot. Each plot shows how Tenure, MonthlyCharges and TotalCharges are distributed over all the categories in the dataset.

The swarm plot shows how many actual data points fall inside each category because the boxplot fails to show this. To read a boxplot, look at the size of the box (range of height) which tells you how many data points are distributed within that category. There are 4 boxes, first split between the values within the category and then split between Churn. The blue box shows the customers within that category that do not churn and the orange shows the customers that do.

The boxes themselves show the 3 quartiles of the range, which do not include the outliers of the distribution. These are all vertical boxplots, so the top of the box is the 3rd quartile. The line inside the box shows the 2nd quartile, or the median of the range and finally, the bottom of the box is the 1st quartile. The overlayed swarm provides a deeper look into the data, the width of the data at certain areas shows how many points are located at that value (tenure,monthlychare,totalcharge.).

In [412]:
create_box_subplots('tenure',df,CATEGORICAL)
In [409]:
create_box_subplots('MonthlyCharges',df,CATEGORICAL)
In [410]:
create_box_subplots('TotalCharges',df,CATEGORICAL)

Above are the box plots show tenure/monthly charges/total charges to each category. The overall consensus is that customers that are short tenured, and have high monthly charges lead to churn. Total charges is not as easy to make general observations on, but it is still clear that low total charges are a feature of clients that churn.

Highlighted Findings
  1. Seniors are split almost 50/50 among clients that do and do not churn.
  2. Clients without phone service tend to be more likely to churn. Multiple Line users are the best, low attrition.
  3. Fiber optic internet service gives the best view into churn, can reduce that feature into (Yes/No) for users with fiberoptic.
  4. Protection services (Online backup, Online Security, Device protection, Tech support) is a good indicator of churn for customers without any of them.
  5. Streaming services also provide alot of insight into attrition and should be considered in the model.
  6. Month-to-Month clients are the most significant portion of clients that churn.
  7. Clients that request paperless billing and make electronic check payments hold the highest churn among those categories.

Feature Engineering

From all our findings, the following are the features that I decided to create for use in my final model:

  1. CUSTOMER ENGAGEMENT | boolean (customerengage): Month2Month (m2m) contracts are a strong indicator of churn based on our analysis. I decided to group the contract feature as either m2m or not. I called this "Customer Engagement" because the m2m contracts are short tenured and have high churn.
  1. YOUNG FAMILY | boolean (youngfam): This feature is a grouping of 3, SeniorCitizens, Partner, Dependents. What I learned from the analysis is that customers with partners, dependents and are not seniors tend to have less churn.
  1. PAPER ELECTRONIC | boolean (paperelect): This feature was based on the finding that customers who choose paperless billing and pay through electronic check have high churn.
  1. FIBER OPTIC | boolean (fiberint): To reduce the internet service into boolean, I decided to recreate it to show those that use fiber optic internet which is where the majority of churn occurs.
  1. PROTECTION PLANS | boolean (noprotection): Customers who do not choose any type of protection; online backup,security, device protection or techsupport, tend to have high churn.
  1. TOTAL SERVICE PACKAGE | integer (totalservices): Having a feature that showed how many services individual customers have would be useful. This includes phone service, internet service, online backup, device protection, tech support and streaming tv/movies.
  1. CUSTOMER SCORING | float (customervalue): A value that is created using the other generated features. The value sums customer engagement and young families, which are the type of customers we want, and subtracts that by the sum of the other boolean values. What this does is group customers by their value to the organization.
In [11]:
#saved as checkpoint to return to

#saved
#save2 = df.copy()

#to return to checkpoint
#df = save2.copy()
In [12]:
# creating the features

#customer engagement
df.loc[:,'customerengage']=1
df.loc[(df['Contract']=='Month-to-month'),'customerengage']=0

#young family
df.loc[:,'youngfam']=0
df.loc[(df['SeniorCitizen']==0) & (df['Partner']=='Yes') & (df['Dependents']=='Yes'),'youngfam']=1

#paper electronic
df.loc[:,'paperelect']=0
df.loc[(df['PaperlessBilling']=='Yes') & (df['PaymentMethod']=='Electronic check'),'paperelect'] = 1

#fiberoptic
df.loc[:,'fiberint']=0
df.loc[df['InternetService']=='Fiber optic','fiberint']=1


#no protection plans
df.loc[:,'noprotection']=0
df.loc[(df['OnlineBackup']=='No') & (df['OnlineSecurity']=='No') & 
       (df['DeviceProtection']=='No') & (df['TechSupport']=='No'),'noprotection']=1

#total services
df['TotalServices'] = (df[['PhoneService', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 
                               'DeviceProtection', 'TechSupport', 'StreamingTV', 
                               'StreamingMovies']]== 'Yes').sum(axis=1)
#customer value
df['customervalue'] = (df['customerengage'] +df['youngfam'])-(df['paperelect']+df['fiberint']+ df['noprotection'])
In [218]:
#drop a few features that are now included in engineered features as well as those
#that have not given evidence based on EDA to show any pattern that could help in predicting Churn
dfnew = df.drop(columns = ['customerID','Contract','Partner','gender',
                   'MultipleLines'])
In [219]:
dfnew.shape #removing those columns keeps me at 23 features, prior to encoding 
Out[219]:
(7043, 23)
In [220]:
dfnew.head()
Out[220]:
SeniorCitizen Dependents tenure PhoneService InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn customerengage youngfam paperelect fiberint noprotection TotalServices customervalue
0 0 No 1 No DSL No Yes No No No No Yes Electronic check 29.85 29.85 No 0 0 1 0 0 1 -1
1 0 No 34 Yes DSL Yes No Yes No No No No Mailed check 56.95 1889.50 No 1 0 0 0 0 3 1
2 0 No 2 Yes DSL Yes Yes No No No No Yes Mailed check 53.85 108.15 Yes 0 0 0 0 0 3 0
3 0 No 45 No DSL Yes No Yes Yes No No No Bank transfer (automatic) 42.30 1840.75 No 1 0 0 0 0 3 1
4 0 No 2 Yes Fiber optic No No No No No No Yes Electronic check 70.70 151.65 Yes 0 0 1 1 1 1 -3
Encoding:
In [221]:
#create a boolean list
BOOLEAN = list(dfnew.columns[dfnew.nunique() <= 2])
#use lambda functionality in pandas to replace all Yes/No to 1/0 in one pythonic string
dfnew[BOOLEAN] = dfnew[BOOLEAN].apply( lambda x: x.replace('Yes',1).replace('No',0))
In [222]:
#multi-val categorical
MULTICAT = list(dfnew.columns[(dfnew.nunique() >2)&(dfnew.nunique() <=10)])
In [223]:
# I won't one hot encode Total services but I will for customer value
dfnew[MULTICAT].head(2)
Out[223]:
InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaymentMethod TotalServices customervalue
0 DSL No Yes No No No No Electronic check 1 -1
1 DSL Yes No Yes No No No Mailed check 3 1
In [224]:
#label encode multi-category variables
labeler = LabelEncoder()
for i in MULTICAT:
    dfnew[i] = labeler.fit_transform(dfnew[i])
In [225]:
#remove the target variable before creating correlation to evaluate only the independent variables (features)
corr = dfnew.drop(columns = ['Churn'])
In [226]:
#create a correlation dataframe
corr = corr.corr()
#for labelling the plot
CORR_COLS = list(corr.columns)
#convert to a numpy array (for plotly)
#carray = np.array(corr)
In [227]:
corr.values.shape
Out[227]:
(22, 22)
In [228]:
#create a heatmap for the correlation matrix

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(10, 10))


# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
corrplot = sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5},xticklabels=True,yticklabels=True)
plt.title('Heatmap of correlations',{'fontsize': 30,
        'fontweight' :30})
plt.xlabel('Features')
plt.ylabel('Features')
plt.savefig('correlations.png', pad_inches=0.5,bbox_inches='tight')
plt.show()

The heatmap is a great way to visualize correlations between independent variables. Using the heatmap above, we can already see that there is some colinearity among the variables which we do not want when we are training the model.

What is multicolinearity and why is it bad?

Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results. It becomes difficult for the model to estimate the relationship between each independent variable and the dependent variable independently because the independent variables tend to change in unison.

In [229]:
#Threshold for removing correlated variables
threshold = 0.9

# Absolute value correlation matrix
corr_matrix = corr.abs()
corr_matrix.head()

# lower triangle of correlations - same as the correlation matrix above
lower = corr_matrix.where(np.tril(np.ones(corr_matrix.shape), k=-1).astype(np.bool))
lower

# Select columns with correlations above threshold
to_drop = [column for column in lower.columns if any(lower[column] >= threshold)]
print('There are %d columns to remove :' % (len(to_drop)))
to_drop
There are 0 columns to remove :
Out[229]:
[]
In [27]:
#import mutual info score  
from sklearn.metrics import adjusted_mutual_info_score



#def X,Y (y - what we want to predict and X is the data to be used)
Y = dfnew['Churn']
X = dfnew.drop(columns=['Churn'])

#collect scores for all columns in X to Y
info_scores = []
for i in X.columns:
    info_score = adjusted_mutual_info_score(X[i],Y)
    info_scores.append(info_score)

IS = pd.DataFrame(info_scores,index=X.columns.tolist()).reset_index()

IS.columns = ['feature','score']

IS = IS.sort_values('score',ascending=False)

plt.figure(figsize = (10,5))
sns.barplot(x='feature',y='score',data=IS)
plt.title('Mutual info scores')
plt.xticks(rotation=90)
plt.show()

Mutual info scores are not always used but in this case it gives me insight into how much info is shared between the features selected and the target Churn.

The goal is to have all the selected independent varialbes have a 'minimum' score to get the best training model.

Here I can see that total charges and phone service can be removed from the data.

Prior to doing that I will fit a random forest model to check for feature importances.

In [28]:
#using a RandomForest classifier, I will train a model for early evaluation of feature importances
from sklearn.ensemble import RandomForestClassifier
In [29]:
#X,Y set already from above - will re do the code
Y = dfnew['Churn']
X = dfnew.drop(columns=['Churn'])

#with defaults because all I need are feature importances
eval_mod = RandomForestClassifier()
#fit model
eval_mod.fit(X,Y)

#feature importances
FI = eval_mod.feature_importances_
In [30]:
#put scores in a dataframe with the column names to sort and plot
FI = pd.DataFrame(FI,index=X.columns.tolist())
FI = FI.sort_values(0,ascending=False)
FI = FI.reset_index()
FI.columns = ['features','scores']
FI = FI.sort_values('scores',ascending=False)
FI = FI.reset_index()
In [31]:
#plot the feature importances
plt.figure(figsize=(12,5))
sns.barplot('features','scores',data=FI,)
plt.xticks(rotation=90)
plt.title('Feature Importances')
plt.show()
In [32]:
#evaluate the score of RF to understand whether to drop columns or choose a better model 
eval_mod.score(X,Y) #97% tells me that the model is likely overfit. Removing variables will be necessary
Out[32]:
0.9792701973590799

The two show the opposite insight into the relative neccesaity of TotalCharges but Phone service seems to have low value in both so I will drop that and save the dataframe for modelling.

In [230]:
# THIS IS FOR CLUSTERING  - SAVE BEFORE DROPPING
dfnew.to_csv('modelready.csv',index=False)

#drop column phone service
dfnew = dfnew.drop(columns = ['PhoneService'])

READY TO MODEL -> MODEL FOR CHURN IN NEXT NOTEBOOK

In [1]:
# MODEL DATA  -  > WILL USE THIS FOR TRAINING THE MODEL
dfnew.to_csv('modeldata.csv',index=False)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-e03efa33bef0> in <module>
      1 # FINAL SAVE FOR MODELLING
      2 
----> 3 dfnew.to_csv('modeldata.csv',index=False)

NameError: name 'dfnew' is not defined

CLUSTERING - ADDITIONAL ANALYSIS

Executive Summary of Clusters

Characteristics of Group 1:
  1. Youngest group (almost no seniors within the group), lots of young families (good)
  2. Mid tenured, ~29 months
  3. ~33% request paperbilling, only ~3% partake in both paper bills and electronic check payment
  4. Highest customer value, 3.83
  5. On average customers have ~2 services with telco
  6. The lowest total and monthly charges of the 3 groups (likely because of the lower total services)
  7. Lowest churn, ~8%

Recommendation: This is a young group that may have just joined telco. It would be valuable to find what made them join and how we can retain them. While they only use 1/2 services it lead to a long term client with potential for growth.

Characteristics of Group 2:
  1. Midddle aged group with yet again, lots of young families
  2. Excellent Customer value, 3.27
  3. Longest Tenured customers, ~55 months
  4. Majority use paperbilling, but they do not use the electronic checks as well
  5. Use the most services, ~5 on average
  6. On average the highest total charges as well as monthly charges
  7. Highlight - Customer Engagement is 77%
  8. Churn is low among the group but above ~13%

Recommendation: I would say that this is the group that is truly the 'best' customer. Even though Group 1 has a higher customer value as well as lower Churn, those customers are also not as long tenured. These customers also bring in the most revenue and are engaged with the company, so a plan to ensure they resign contracts would be priorty 1.

Characteristics of Group 3:
  1. Oldest group, with very few families
  2. Shortest tenure group
  3. Majority have fiber internet
  4. Most likely to use paper billing and electronic check
  5. Most likely to pay for any protection plan (bad)
  6. Very low customer engagement
  7. Highest churn, 58%

Recommendation: This group seems to include alot of Senior customers. Through our EDA we saw that seniors are likely to Churn. There should be some evaluation here, because, there is also a lot of single/non-family customers here.

Based on the characteristics, it's clear that Group 3 is a problem and those particular characteristics will likely lead to churn. Using clustering is very helpful when possible, especially dataset that makes intepretations of the clusters easy to evaluate.

The findings from the clusters can be used to start strategic planning for marketing teams or client-retention teams to find out more information about the people within Group3.

Complete Cluster Analysis below

Reasons for clustering:

I decided to do a clustering of the customers to see if it could be useful in knowing the characteristics of the customers that churn. The chosen model should give insight into characteristics of customers that are likely to churn which should help in strategy and marketing decisoins.

What you will see is a KMeans clustering of the data. The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. To put it simply, the model is best when it is able to have an inertia closest to 0.

Since this is an unsupervised modelling method, there is no scoring. However, we can visualize/interpret the groupings created to see if there are any similarities to be gleaned.

In [176]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler


#Y = dfnew['Churn']
X = dfnew

#scale the data - important because there are many binary features along with numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#KMeans -evaluating inertia for cluster selection - after processing data
kinert = []

#Running a for loop over 15 values of k, assigning it to the n_clusters paramter of KMeans 
for k in range(1,15):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X_scaled)
    
    #append inertia to list to then plot
    kinert.append(kmeans.inertia_)


plt.figure(figsize=(15,5))
plt.plot(range(1,15),kinert)
plt.xlabel("number of k (cluster) value")
plt.ylabel("Inertia")
plt.show()
What does the inertia show/why does it matter?

The algorithm is trying to reduce the inertia value to 0 which is what is considered the optimal 'cluster'. Looking at the line, I can see noticeable elbows at 2, 3 and 5 clusters. The easiest to see is the one at 3 clusters which makes be believe the model is likely most appropriately fit at 3 clusters. However, I will try more than one before making a decision.

Choosing the right cluster:

Why we don't use a cluster of 6,8,10 is because at the end of the day this cluster needs to make sense for the dataset. Since this is an unsupervised learning model, there is no 'right answer' to evaluate the result. So when we decide on a cluster, we want to ensure that we pick a value that gives us meaningful clusters that can then be used for other decision making startegies.

Decision on clusters:

I will create 2 models with: 2,3,5 clusters

I will evaluate the 3 clusters using plots, then choose the result I believe gives me the best interpretation of the customers. Finally, provide characteristics of the groups.

In [177]:
#pipelining  - to scale,reduce and model at once
from sklearn.pipeline import make_pipeline
In [178]:
#scaling and reducing dimensionality for clustering model
#(I wont actually assign any value for # components leaving all defaults)
#create random state to ensure values stay same over time - or else all interpretations will be lost
randstate = 42

scaler = StandardScaler()
modeler2 = KMeans(n_clusters=2,random_state=randstate)
modeler3 = KMeans(n_clusters=3,random_state=randstate)
modeler5 = KMeans(n_clusters=5,random_state=randstate)

#3 pipes for different clusters
pipe2 = make_pipeline(scaler,modeler2)
pipe3 = make_pipeline(scaler,modeler3)
pipe5 = make_pipeline(scaler,modeler5)

#fit and predict 
pred_2 = pipe2.fit_predict(X)
pred_3 = pipe3.fit_predict(X)
pred_5 = pipe5.fit_predict(X)
In [179]:
def cluster_scatter(feature1,feature2,feature3,pred_n):
    plt.figure(figsize=(10,5))
    
    colours = sns.color_palette(palette='nipy_spectral',n_colors =len(np.unique(pred_n)),desat=0.9) 

    
    
    plt.subplot(2,2,1)
    sns.scatterplot(x=X[feature1],y=X[feature2],hue=pred_n,alpha=0.5,palette=colours)
    plt.legend(loc='upper right', bbox_to_anchor=(1.2, 1.05))
    plt.title(f'{feature1} v {feature2} CLUSTERS')


    plt.subplot(2,2,2)
    sns.scatterplot(x=X[feature1],y=X[feature2],hue=Y,alpha=0.5,palette = 'Pastel1')
    plt.legend(loc='upper right', bbox_to_anchor=(1.3, 1.05))
    plt.title(f'{feature1} v {feature2} CHURN')
    
    plt.subplot(2,2,3)
    sns.scatterplot(x=X[feature1],y=X[feature3],hue=pred_n,alpha=0.5,palette=colours)
    plt.legend(loc='upper right', bbox_to_anchor=(1.2, 1.05))
    plt.title(f'{feature1} v {feature2} CLUSTERS')
    

    plt.subplot(2,2,4)
    sns.scatterplot(x=X[feature1],y=X[feature3],hue=Y,alpha=0.5,palette='Pastel1')
    plt.legend(loc='upper right', bbox_to_anchor=(1.3, 1.05))
    plt.title(f'{feature1} v {feature2} CHURN')
    
    
    plt.tight_layout()
    
    plt.show()

Plotting clusters to understand the differences

Below you will see 3 sets of plots that show, the same plots coloured differently. On the right you will see the plot, with points colored by Churn and on the left the plots are coloured by cluster.

In [180]:
# 2 Groups
cluster_scatter('tenure','MonthlyCharges','TotalCharges',pred_2)
In [181]:
# 3 Groups
cluster_scatter('tenure','MonthlyCharges','TotalCharges',pred_3)
In [182]:
# 5 Groups
cluster_scatter('tenure','MonthlyCharges','TotalCharges',pred_5)

The 3 models show variations of groupings that tell different stories.

To better understand them I will analyze each one individually and at the end make a decision on which I believe gives the most information on customer types.

Create plotting variables to reduce lines of code and make plotting easier.

In [183]:
#define plots


def box_cluster(hueval,yval,df,col_list):
    size = int(len(col_list))
    fig, axs = plt.subplots(5,4, figsize=(20,30))
    
    for ax, d in zip(axs.ravel(), df[col_list]):
        sns.stripplot(x=df[d],y=yval,data=df,hue=hueval,size=0.9,ax=ax)
        sns.violinplot(x=df[d],y=yval,data=km2,hue=hueval,ax=ax)
        ax.set_title(str(d))
        for tick in ax.get_xticklabels():
            tick.set_rotation(90)
        plt.tight_layout()
        ax.legend(loc='lower right')

    plt.show()
    
#for viewing different distributions 
def group_distribution(columnname):
    sns.set_palette(palette='nipy_spectral')
        
    #create normal dist to overlay
    normald = np.random.normal(loc=km2[columnname].mean(),scale=km2[columnname].std(),size=len(g1[columnname]))

    #figure
    plt.figure(figsize= (8,4))
        
    if len(clust.clusters.value_counts()) == 2:
    
        sns.distplot(g1[columnname],kde=True,label='Group1',hist=False,
                     kde_kws={'shade':True,'bw':'silverman','kernel':'gau'})
        sns.distplot(g2[columnname],kde=True,label='Group2',hist=False,
                     kde_kws={'shade':True,'bw':'silverman','kernel':'gau'})
    
        sns.kdeplot(normald,clip=(0,max(g1[columnname])),shade=False,label='Overall Mean without Groups')
        plt.ylabel('Probability distribution frequency')
        plt.xlabel(f'{columnname} values')
        plt.title(f'Distribution of {columnname} for each group')
        
        plt.legend()
        plt.show()
        
    elif len(clust.clusters.value_counts()) == 3:
        #3 clusters
        sns.kdeplot(g1[columnname],shade=True,label='Group1',kernel='gau',bw='silverman')
        sns.kdeplot(g2[columnname],shade=True,label='Group2',kernel='gau',bw='silverman')
        sns.kdeplot(g3[columnname],shade=True,label='Group3',kernel='gau',bw='silverman')
        sns.kdeplot(normald,clip=(0,max(g1[columnname])),shade=False,label='Overall Mean without Groups')
        plt.ylabel('Probability distribution frequency')
        plt.xlabel(f'{columnname} values')
        plt.title(f'Distribution of {columnname} for each group')
        plt.legend()
        plt.show()
    
    else:
        #5 clusters
        sns.kdeplot(g1[columnname],shade=True,label='Group1',kernel='gau',bw='silverman')
        sns.kdeplot(g2[columnname],shade=True,label='Group2',kernel='gau',bw='silverman')
        sns.kdeplot(g3[columnname],shade=True,label='Group3',kernel='gau',bw='silverman')
        sns.kdeplot(g4[columnname],shade=True,label='Group4',kernel='gau',bw='silverman')
        sns.kdeplot(g5[columnname],shade=True,label='Group5',kernel='gau',bw='silverman')
        sns.kdeplot(normald,clip=(0,max(g1[columnname])),shade=False,label='Overall Mean without Groups')
        
        plt.ylabel('Probability distribution frequency')
        plt.xlabel(f'{columnname} values')
        plt.title(f'Distribution of {columnname} for each group')
        plt.legend()
        
        plt.show()
In [184]:
def boul_dist(columnname):
    sns.set_palette(palette='nipy_spectral')   
    #figure
    plt.figure(figsize= (8,4))
    #2 clusters - boulion   
    if len(clust.clusters.value_counts()) == 2:

        sns.kdeplot(g1[columnname],label='Group1',shade=True,bw=0.3,kernel='biw',
                    gridsize=None,clip=(-4,4),cut=3)
        sns.kdeplot(g2[columnname],label='Group2',shade=True,bw=0.3,kernel='biw',
                   gridsize=None,clip=(-4,4),cut=3)
    
    elif len(clust.clusters.value_counts()) == 3:
        
        sns.kdeplot(g1[columnname],label='Group1',shade=True,bw=0.3,kernel='biw',
                    gridsize=None,clip=(-4,4),cut=3)
        sns.kdeplot(g2[columnname],label='Group2',shade=True,bw=0.3,kernel='biw',
                   gridsize=None,clip=(-4,4),cut=3)
        sns.kdeplot(g3[columnname],label='Group3',shade=True,bw=0.3,kernel='biw',
                   gridsize=None,clip=(-4,4),cut=3)
    
    
    else:
        sns.kdeplot(g1[columnname],label='Group1',shade=True,bw=0.3,kernel='biw',
                    gridsize=None,clip=(-4,4),cut=3)
        sns.kdeplot(g2[columnname],label='Group2',shade=True,bw=0.3,kernel='biw',
                   gridsize=None,clip=(-4,4),cut=3)
        sns.kdeplot(g3[columnname],label='Group3',shade=True,bw=0.3,kernel='biw',
                   gridsize=None,clip=(-4,4),cut=3)
        sns.kdeplot(g4[columnname],label='Group4',shade=True,bw=0.3,kernel='biw',
                    gridsize=None,clip=(-4,4),cut=3)
        sns.kdeplot(g5[columnname],label='Group5',shade=True,bw=0.3,kernel='biw',
                   gridsize=None,clip=(-4,4),cut=3)    
    
    
    plt.ylabel('Probability distribution frequency')
    plt.xlabel(f'{columnname} values')
    plt.title(f'Distribution of {columnname} for each group')
    plt.legend()
    return plt.show()
In [185]:
def binomial_dist(columnname):
    sns.set_palette(palette='nipy_spectral')    
    #figure
    plt.figure(figsize= (8,4))
    
    if len(clust.clusters.value_counts()) == 2:
        sns.kdeplot(g1[columnname],label='Group1',shade=True,bw=0.2,kernel='gau')
        sns.kdeplot(g2[columnname],label='Group2',shade=True,bw=0.2,kernel='gau')


    elif len(clust.clusters.value_counts()) == 3:
        sns.kdeplot(g1[columnname],label='Group1',shade=True,bw=0.3,kernel='gau')
        sns.kdeplot(g2[columnname],label='Group2',shade=True,bw=0.3,kernel='gau')
        sns.kdeplot(g3[columnname],label='Group3',shade=True,bw=0.3,kernel='gau')
        
    else:
        sns.kdeplot(g1[columnname],label='Group1',shade=True,bw=0.1,kernel='gau')
        sns.kdeplot(g2[columnname],label='Group2',shade=True,bw=0.1,kernel='gau')
        sns.kdeplot(g3[columnname],label='Group3',shade=True,bw=0.1,kernel='gau')   
        sns.kdeplot(g4[columnname],label='Group4',shade=True,bw=0.1,kernel='gau')
        sns.kdeplot(g5[columnname],label='Group5',shade=True,bw=0.1,kernel='gau')
   
    
    
    plt.ylabel('Probability distribution frequency')
    plt.xlabel(f'{columnname} values')
    plt.title(f'Distribution of {columnname} for each group')
    plt.legend()    
        
        
        
    return plt.show()

2 Group Clusters

In [186]:
#load the model ready data again to remove the scaling 
clust = pd.read_csv('modeldata.csv')
In [187]:
#create 2 cluster df
clust['clusters'] = pred_2 #use the 2 cluster predictions as a new column in the dataset
km2 = clust.copy()
In [188]:
#group the data by the clusters and show the mean of the customers in each
GROUPS2 = km2.groupby('clusters').mean().transpose()
#create temporary names for 3 groups, show the first few rows
GROUPS2.columns = ['Group1','Group2']
GROUPS2.head()  
Out[188]:
Group1 Group2
SeniorCitizen 0.101584 0.227743
Dependents 0.448389 0.138421
tenure 48.036319 15.404022
InternetService 0.928181 0.813073
OnlineSecurity 1.149372 0.400769

From this snippet of data we can see that the model has identified 2 customer types, an older group that is short tenured and a younger group that is longer in tenure.

I will evaluate them further before defining characterisics.

In [189]:
g1 = km2.loc[km2['clusters']==0]
g2 = km2.loc[km2['clusters']==1]
In [190]:
import statsmodels as sm
from statsmodels.nonparametric import kernels,bandwidths,kernel_density,kde,kdetools

#test for density binomial for kde
plt.figure()
densityfunc = kde.kdensity(g1['TotalServices'],kernel='gau',bw=0.2,gridsize=None)
plt.plot(densityfunc[1],densityfunc[0])

#for boul
plt.figure()
densityfunc = kde.kdensity(g1['noprotection'],kernel='biw',bw=0.2,gridsize=None)
plt.plot(densityfunc[1],densityfunc[0])


#use these density values for creating the binomial and boulion dist
Out[190]:
[<matplotlib.lines.Line2D at 0x1a2b22f5c0>]
Evaluating the features using density plots
In [191]:
# 3 continuous features:
group_distribution('MonthlyCharges')
group_distribution('TotalCharges')
group_distribution('tenure')

Probability Density Distribtions of the continuous features in the model:

Monthly Charges:

Group 2 has 3 nodes at and there is a slight skew to the left. Group 1 also has 3 nodes at 20, 60 and 110, but the 20 and 110 are very distinct There node from 60 to 80 holds the majority of values in Group 1. The data is distributed across all the values in Monthly Charges.

Total Charges:

Group 2 has a heavily skewed distribution to the left at 0, with a long tail on the right. Group 1 is evenly distributed across the entire dataset but definitely has a large number of points near the low values.

Tenure:

Group 2 is again skewed to the left close to 0 Group 1 is made up of mostly long tenured customers but their distribution across the values carries more lower valued customers than group 1 does higher values.

The distributions below are for binomial and categorical values, therefore I used a cumulative distribution which is a more appropriate method of visualizing a discreet variable. To read this plot each change in y is equivalent to the probability frequency of that x value.

In [192]:
# The created values that affect customer value negatively
boul_dist('fiberint')
boul_dist('noprotection')
boul_dist('paperelect')
#the 2 positive values 
boul_dist('customerengage')
boul_dist('youngfam')

These plots tells us the following about each group:

Group 2 consists of the majority of customers that choose fiber internet, however there are 40% that do not. These customers are also split 60/40 picking noprotection (0,1) and choosing paper billing, paying by electronic check(0,1). What this says about the customers is they make up alot of the negative traits in the customers.

Group 1 customers on the other hand do not use fiber internet 60% of the time and 40% do. A very small section of these customers choose noprotection services and the paper billing + electronic check payments. Group 2 also consists of highly engaged customers, and a majority of them have families

In [194]:
#show the Total Services and customer value binomial distributions, churn
binomial_dist('TotalServices')
binomial_dist('customervalue')
boul_dist('Churn')

It looks like with the customer Total Services, both Group 1 and 2 have a lot customers with only 1 Total Service. However, Group 1 has a more normal distribution across all the values compared to Group 2.

Based on what we saw on the positive and negative created features, this outcome is not out of character from our analysis. Group 1 has more high value customers in our data.

We will be predicting for Churn and based on these clusters, the model does well at identifying the high value Group (1), and it is definitely related to a Churn of 0.

With 2 groupings, I don't think there is enough of an evaluation available to make valid strategic decisions. Instead I will try the max number of groupings (5).

5 Group Clusters

In [195]:
#load the model ready data again to remove the scaling 
clust = pd.read_csv('modeldata.csv')
In [196]:
#create 2 cluster df
clust['clusters'] = pred_5 #use the 2 cluster predictions as a new column in the dataset
km5 = clust.copy()
In [197]:
#group the data by the clusters and show the mean of the customers in each
GROUPS5 = km5.groupby('clusters').mean().transpose()
#create temporary names for 3 groups, show the first few rows
GROUPS5.columns = ['Group1','Group2','Group3','Group4','Group5']
GROUPS5.head()  
Out[197]:
Group1 Group2 Group3 Group4 Group5
SeniorCitizen 0.156697 0.088300 0.304949 0.249399 0.033443
Dependents 0.409623 0.331126 0.140466 0.155573 0.421639
tenure 59.652796 27.895511 26.465793 12.314354 30.566557
InternetService 0.588427 0.035320 0.949054 0.674419 2.000000
OnlineSecurity 1.276983 1.066961 0.451965 0.003208 1.000000

With 2 clusters, all we could deduce from this snippet were 2 customer types. Here we see 5;

There are 2 groups (3&4) that are older, but do not have many families within them. Another 2 groups (2&5) that are significantly young and have a large portion of customers with families. The last group, Group 1, is middle aged, with families and has the longest tenure of them all.

In [198]:
g1 = km5.loc[km5['clusters']==0]
g2 = km5.loc[km5['clusters']==1]
g3 = km5.loc[km5['clusters']==2]
g4 = km5.loc[km5['clusters']==3]
g5 = km5.loc[km5['clusters']==4]
In [199]:
#look at the 3 numerical columns first:
group_distribution('MonthlyCharges')
group_distribution('TotalCharges')
group_distribution('tenure')

Trying to evaluate these groupings is not visually pleasing nor does it seem to really provide any useful analysis. Based on what we know, there are 2x2 similar groupings and 1 unique. Using 3 clusters should be more valuable.

3 Group Clusters

In [200]:
#load the model ready data again to remove the scaling 
clust = pd.read_csv('modeldata.csv')

#create 2 cluster df
clust['clusters'] = pred_3 #use the 2 cluster predictions as a new column in the dataset
km3 = clust.copy()

#group the data by the clusters and show the mean of the customers in each
GROUPS3 = km3.groupby('clusters').mean().transpose()
#create temporary names for 3 groups, show the first few rows
GROUPS3.columns = ['Group1','Group2','Group3']
GROUPS3.head()  
Out[200]:
Group1 Group2 Group3
SeniorCitizen 0.037818 0.178341 0.279543
Dependents 0.418392 0.356682 0.120712
tenure 28.892516 55.054839 15.223634
InternetService 1.234076 0.603687 0.736129
OnlineSecurity 1.032643 1.125346 0.223634

Unlike the 5 cluster model, the 3 cluster groupings make it a lot simpler and more generalized because each group is unique.

Group 1 is the youngest, has the most dependents and is mid tenured, compared to 2,3.

Group 2 are middle aged, with a significantly longer tenure than 1,3. This group also likely has some young/older families.

Group 3 is the oldest group of the clusters and also the shortest tenured, least dependents.

In [201]:
g1 = km3.loc[km3['clusters']==0]
g2 = km3.loc[km3['clusters']==1]
g3 = km3.loc[km3['clusters']==2]
In [202]:
# continous variable plots
group_distribution('MonthlyCharges')
group_distribution('TotalCharges')
group_distribution('tenure')

Based on these plots, Group 1 (the youngest group) also has the lowest monthly and total charges. However, these customers can range from a short tenure to a very long one.

Group 2 is the opposite of Group 1 in all characteristics.

Group 3 is actually the shortest tenured, and is similar to Group 1 in total charges. Where they diverge is in the Monthly charges. It seems Group 3 is distributed with 3 nodes but the most significant of the three is around $80 which is at the more costly end of the specturm.

In [ ]:
 
In [203]:
#created values
boul_dist('fiberint')
boul_dist('noprotection')
boul_dist('paperelect')
boul_dist('customerengage')
boul_dist('youngfam')

Of the three negative values, Group 1 is actually significantly better than 2,3. These customers tend to not have fiber internet, they do not get any of the protection plans and almost all these clients are unlinkely to use paper billing/electronic check payments. This makes sense because in our earlier EDA we found that younger customers were less likely to partake in any of these. Group 1 is close to evenly split for the 2 positive values (customer engagement and young families)

Group 2 is likely going to be the mid tier customer, they tend not to partake in the negative value services/features and have a very engaged customer base.

Group 3 is the worst of the three, their customers are split 50/50 for the three negative values but do not engage with the company and have a negligible amount of young families

In [204]:
#show the Total Services and customer value binomial distributions, boul Churn 
binomial_dist('TotalServices')
binomial_dist('customervalue')
boul_dist('Churn')

Based on what was expected from the analysis of the created rows, Group 1 is our highest valued customers, Group 2 is the second highest and Group 3 consists of all the low ranking clients. Group 1&2 also have very little churn from their customers.

Cluster Group Decision:

After looking at 2,3,5 cluster groups, it is clear that the 3 cluster model is able to give us enough valuable information about each group, where they are all unique and cover a different 'type' of customer.

The 3 cluster model was the most valuable in grouping customer 'types'. It gave 3 unique customers, with different characteristics that could be used for further marketing and UX analysis

Evaluation of customer clusters

In [234]:
#USE MODEL READY FOR CLUSTER EVAL
clust = pd.read_csv('modelready.csv')

#create 2 cluster df
clust['clusters'] = pred_3 #use the 2 cluster predictions as a new column in the dataset
km3 = clust.copy()

#group the data by the clusters and show the mean of the customers in each
GROUPS3 = km3.groupby('clusters').mean().transpose()
#create temporary names for 3 groups, show the first few rows
GROUPS3.columns = ['Group1','Group2','Group3']
GROUPS3
Out[234]:
Group1 Group2 Group3
SeniorCitizen 0.037818 0.178341 0.279543
Dependents 0.418392 0.356682 0.120712
tenure 28.892516 55.054839 15.223634
PhoneService 0.871815 0.929954 0.911902
InternetService 1.234076 0.603687 0.736129
OnlineSecurity 1.032643 1.125346 0.223634
OnlineBackup 0.940287 1.393548 0.422702
DeviceProtection 0.870223 1.498618 0.394748
TechSupport 0.977707 1.237788 0.199915
StreamingTV 0.750000 1.542857 0.723422
StreamingMovies 0.748408 1.557604 0.732740
PaperlessBilling 0.326433 0.690783 0.784413
PaymentMethod 1.783439 1.087558 1.799238
MonthlyCharges 33.725717 90.599355 74.035155
TotalCharges 947.803185 5003.945069 1193.215544
Churn 0.083201 0.128571 0.584922
customerengage 0.564490 0.776959 0.027107
youngfam 0.335589 0.304147 0.069039
paperelect 0.033838 0.200922 0.517154
fiberint 0.020701 0.603687 0.734435
noprotection 0.012739 0.004147 0.519271
TotalServices 1.711385 5.107834 2.257942
customervalue 3.832803 3.272350 1.325286

Executive Summary of Clusters

Characteristics of Group 1:
  1. Youngest group (almost no seniors within the group), lots of young families (good)
  2. Mid tenured, ~29 months
  3. ~33% request paperbilling, only ~3% partake in both paper bills and electronic check payment
  4. Highest customer value, 3.83
  5. On average customers have ~2 services with telco
  6. The lowest total and monthly charges of the 3 groups (likely because of the lower total services)
  7. Lowest churn, ~8%

Recommendation: This is a young group that may have just joined telco. It would be valuable to find what made them join and how we can retain them. While they only use 1/2 services it lead to a long term client with potential for growth.

Characteristics of Group 2:
  1. Midddle aged group with yet again, lots of young families
  2. Excellent Customer value, 3.27
  3. Longest Tenured customers, ~55 months
  4. Majority use paperbilling, but they do not use the electronic checks as well
  5. Use the most services, ~5 on average
  6. On average the highest total charges as well as monthly charges
  7. Highlight - Customer Engagement is 77%
  8. Churn is low among the group but above ~13%

Recommendation: I would say that this is the group that is truly the 'best' customer. Even though Group 1 has a higher customer value as well as lower Churn, those customers are also not as long tenured. These customers also bring in the most revenue and are engaged with the company, so a plan to ensure they resign contracts would be priorty 1.

Characteristics of Group 3:
  1. Oldest group, with very few families
  2. Shortest tenure group
  3. Majority have fiber internet
  4. Most likely to use paper billing and electronic check
  5. Most likely to pay for any protection plan (bad)
  6. Very low customer engagement
  7. Highest churn, 58%

Recommendation: This group seems to include alot of Senior customers. Through our EDA we saw that seniors are likely to Churn. There should be some evaluation here, because, there is also a lot of single/non-family customers here.

Based on the characteristics, it's clear that Group 3 is a problem and those particular characteristics will likely lead to churn. Using clustering is very helpful when possible, especially dataset that makes intepretations of the clusters easy to evaluate.

The findings from the clusters can be used to start strategic planning for marketing teams or client-retention teams to find out more information about the people within Group3.

In [ ]: